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Abstract 

Dependencies between loop iterations cannot always be charac- 
terized during program compilation. Doacross loops typically make 
use of a-priori knowledge of inter-iteration dependencies to carry out 
required synchronizations. We propose a type of doacross loop that 
allows us to schedule iterations of a loop among processors without 
advance knowledge of inter-iteration dependencies. The method pro- 
posed for loop iterations requires us to carry out parallelizable pre- 
processing and postprocessing steps during program execution. 


*This work was supported by NASA grant NAS1-18605 while the authors were in 
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1 Introduction 


Dependencies between loop iterations cannot always be characterized during 
program compilation. This inability to characterize dependencies can inhibit 
exploitation of potential parallelism if one is restricted to usual types of 
parallel loop constructs, i.e. doall or doacross loops [3] [2]. Doall loops do 
not impose any ordering on loop iterations while doacross loops impose a 
partial execution order in the sense that some of the iterations are forced 
to wait for the partial or complete execution of some previous iterations. 
Typically, doacross loops make use of a-priori knowledge of inter-iteration 
dependencies to carry out required synchronizations. 

The method we outline here is a variant of a doacross loop that allows 
us to schedule iterations of a loop onto processors in the absence of prior 
knowledge about inter-iteration dependencies. We call this type of doacross 
loop the preprocessed doacross. 

We use symbolic transformations to produce from a given loop: (1) inspec- 
tor procedures that perform execution time preprocessing, and (2) executors 
or transformed versions of source code loop structures. These transformed 
loop structures carry out the calculations planned in the inspector procedures. 
Characterizing the cost of execution time preprocessing is a critical aspect of 
this research. One requirement is that the execution time preprocessing itself 
be parallelizable. The preprocessing required for the preprocessed doacross 
loop is fully parallelizable. 

In Section 2, we describe the preprocessed doacross parallel construct, 
and in Section 3 we present results from two sets of experiments designed to 
characterize the performance tradeoffs manifest by using this construct. 

2 The Preprocessed Doacross Loop 

2.1 Overview 

A doacross loop is frequently used when one needs to parallelize loops with 
non-independent loop iterations. Typically it is necessary, before executing 
the loop, to know the distances of dependencies between statements in dif- 
ferent loop iterations. It is possible to carry out a simple form of execution 
time preprocessing that eliminates the need to know dependency distances. 


2 


do i=l,N 

y(a(i)) = yCb(i)) •••• 

end do 


Figure 1: Loop with Execution Time Determined Dependencies 


parallel do i-l,N 

S 1 : while (ready (b ( i) ) . eq . NOTDONE) 

endwhile 

<32: y(i) = y(b(i)) 

S3: ready (i) = DONE 

end parallel do 


Figure 2: Parallelized Loop with True Dependencies 


In Figure 1, we present a code fragment that wiU be used to demonstrate 
the structure of the inspector and executor loops in a simplified preprocessed 
doacross loop. We assume that there are no output dependences between 
left hand side array references; in Figure 1 this means that no wo e emen s 

of array a have the same value. ... 

We first assume that all dependencies are true dependencies, i.e aUJ 
= i and b(i) < i. As we show in Figure 2, we can use a shared array 
ready to make certain that the data dependencies are satisfied. Before the 
loop executes, ready is initialized to NOTDONE; when a new array element 
y(i) is calculated, we set readyfi) - DONE (statement S3). When yCbU)) 
fs required to satisfy a dependence in Figure 2, a busy wait is earned out 

(Statement SI) until y(b(i)) has been calculated , . be - 

In the case when some of the b(i) > i, the dependence relation ° e 
tween loop iterations are in fact antidependencies. To accommodate these 
antidependencies, we transform the loop in Figure 1 so that du^ng the course 
of the computation, all writes to y in Figure 1 are transformed into writes to 
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a new array ynew. A reference to y(b(i)) in Figure 1 may or may not have 
already been written to during an earlier loop iteration. When b(i) < i 
we use ynew(b(i)) in the right hand side of the transformed loop and when 
b(i) > i we use y(b(i)). In many cases it will be necessary to copy the 
newly computed elements of ynew back into y after the computation in the 
loop is done. 

If we do not assume that a(i) is equal to i, the order in which elements 
of y are written in the sequential loop (Figure 1) is determined by integer 
array a. When a right hand side array element y(b(i)) needs to be accessed, 
we will need to determine whether we should use an old or an updated value 
^ ^ y(b(i) ) in Figure 1 is written to during an earlier loop iteration j 

l we use y(b(i)) in the transformed code, otherwise we use ynew(b(i)). 
An array iter can be initialized during a preprocessing phase, so that 

• the value i is stored in iter (a (i)) 

• all other elements of iter(a(i) ) are set equal to a large integer (MAXINT). 

If iter(a(i)) < i for some iteration i of the transformed loop, a true de- 
pendency involving y exists and we use y(b(i)). Alternately, if iter(a(i)) 

> i we use ynew(b(i)). 

In order to limit the cost of initialization and the use of memory associated 
with this implementation of the doacross construct, we reuse the same arrays 
iter and ready for multiple preprocessed doacross loops. A (parallelized) 
postprocessing phase can be carried out after the loop is finished during 
which iter(a(i)) is set equal to MAXINT and ready(a(i)) is set equal to 
NOTDONE. Figure 3 serves to summarize pre and postprocessing required for 
the preprocessed doacross loop. 


2.2 A More Complex Example 

In this section, we will examine in some detail how the doacross transforma- 
tions would be carried out in a slightly more complex case. Following this 
exposition, experimental results obtained from this example will be presented 
in Section 3. 

In the loop SI in Figure 4, up to M+l separate elements of y are read (otice 
that the inner loop goes from 1 to M). The right hand side elements of y in 
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Preprocessing 

parallel do i=l,N 
iter(a(i)) = i 
end parallel do 

Postprocessing 

parallel do i=l,N 

iter(a(i)) = MAXINT 
ready (a(i)) = NOTDONE 
yold(a(i) ) = ynew(a(i)) 
end parallel do 


Figure 3: Pre and Postprocessing Steps 


iteration i may or may not have dependency relations with any loop itera- 
tion of SI (including iteration i itself). Any dependency can be either a true 
dependency or an antidependency. Figure 5 depicts a transformed version of 
the loop shown in Figure 4. As was described in Section 2.1, iter (a(i) ) is 
set to i before the parallelized loop is executed. When iter(b(i)+nbrs(j)) 
is less than or equal to i, we use ynew, the newly computed value of y (state- 
ments S5 and S8). Note that when iter(b(i)+nbrs(j)) is strictly less than 
i (statement S3), it is necessary to make sure that the true dependency is sat- 
isfied. When iter(b(i)+nbrs(j)) is equal to i, wedo not busy wait because 
the dependency is within iteration i. Finally, when iter(b(i)+nbrs(j)) 
is greater than i, either a reference to y from some later loop iteration is 
related to y(b(i)+nbrs(j)) by an antidependency relation or alternately, 
y(b(i)+nbrs(j)) is not written to anywhere in the loop nest. In either 
case, we use the old value of y and do not busy wait (statement S7). After 
the parallelized loop is completed, postprocessing analogous to that depicted 
in Figure 3 is carried out. 
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SI do i=l,N 
do j = l ,M 

yCa(i)) = y(a(i) ) + val(j)*y(b(i) + nbrs(j)) 
end do 
end do 


Figure 4: Preprocessed Doacross Test Loop 


51 parallel do i=L,N 

52 ynew(a(i) ) = y(a(i)) 

' do j=l,M 

offset = b(i) + nbrs(j) 
check = iter(offset) - i 
if (check. It. 0) then 

while (ready (off set) . ne . DONE) 
endwhile 

ynew(a(i) ) = ynew(a(i)) + val(j)*ynew (offset) 
else if (check. gt.O) 

ynew(a(i) ) * ynew(a(i)) + vals(j)*y(offset) 

else 

ynew(a(i) ) = ynew(a(i)) + vals(j)*ynew(offset) 

endif 
end do 

ready(a(i)) = DONE 
end parallel do 
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Figure 5: Parallelized Preprocessed Doacross Test Loop 
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2.3 Further Variants 

The transformations we have described in this paper utilize several arrays to 
schedule the iterations in parallel. These arrays will typically be the size of 
the index set, resulting in large utilization of memory. There are a number of 
ways in which the memory used by the preprocessed doacross can be reduced. 
It is possible to transform the original loop L into a pair of nested loops Li nner 
and iouter- The inner loop L inner would range over contiguous iterations of 
the original loop L. Loop L inner would be parallelized using the preprocessed 
doacross methods described above; loop Loiter would be carried out in a 
sequential manner. Preprocessing and postprocessing involving arrays ready, 
iter , ynew , and yold is carried out before and after each set of Linner 
iterations. This transformation reduces memory requirements because during 
each iteration of L^ur we can reuse ready and iter. 

When the left hand side arrays are indexed by a linear subscript function 
(i.e. a(i) is replaced by some known linear function cxi + d). it is possible 
to eliminate the execution time preprocessing phase along with the need to 
allocate storage for array iter. For the loop depicted in Figure 4, we can 
determine whether y(b(i) + nbrs(j)) can be written to by testing to see 
whether (b(i) + nbrs(j) - d mod c) is equal to 0. If a write is carried out 
it occurs during loop iteration (b(i) + nbrs(j) — d)/c. 


3 Performance of Preprocessed Doacross 

In this section, we provide experimental results for the performance of the 
inspectors and executors described in Section 2. The following timings were 
done on an Encore Multimax/320 with 13 megahertz APC/02 boards and 
version 2.1 of the FORTRAN compiler. Parallel efficiency is defined as 
T,eq/{p * Tpar), where T, eq is the time required to solve a problem using an 
optimized sequential version, is the time required on the same problem 
using a parallel code on p processors. 

3.1 Preprocessed Test Loop 

In this section we report on some experiments to characterize the performance 
of the preprocessed doacross loop construct. We consider Figure 4, where we 
have initialized arrays nbrs and a, such that nbrs(j) = 2j-L, and a(i) 
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= 2i. We parallelize this loop using the pre-processed doacross construct. 
In the data presented below, we assess the costs of the preprocessing and 
postprocessing outlined in Section 2.1. So that we may be better able to 
interpret the test results, in Figure 5, we have chosen to initialize a using 
a simple linear left hand side array index subscript function. We use the 
transformations described in Section 2.2. 

In Figure 6 we depict parallel efficiencies on 16 processors obtained when 
we set N equal to 10000, M equal to either 1 or 5, and varied L from 1 to 
14. Recall that in loop SI in Figure 4, up to M + 1 separate elements of y 
are read. For odd numbered values of L, there are no dependencies between 
outer loop iterations. The efficiencies we see for those L values reflect the 
overheads of : 

1. performing the runtime preprocessing and postprocessing 

2. performing execution time dependency checks 

For M equal to 1 and 5 efficiencies observed for odd L values are approximately 
33% and 50% respectively. 

The efficiencies for even values of L increase monotonically for both values 
of M. This is understandable because as L increases, the number of outer loop 
iterations between dependencies also increases. 


3.2 Sparse Triangular Solves 

We now consider a slightly different test loop which is used to solve sparse 
triangular systems of equations. Many of the sparse triangular systems we 
use for model problems arise from incompletely factored matrices obtained 
from a variety of discretized partial differential equations. The solution of 
these sparse triangular systems accounts for a large fraction of the sequential 
execution time of linear solvers that use Krylov methods[l]. The data de- 
pendencies between the elements of y are determined by the values assigned 
to the data structure column during program execution. These dependencies 
inhibit the parallelization of the outer loop (statement Si, Figure 7). A de- 
scription of the structure of the triangular systems used in our experiments 
is found in [1], outlined in the appendix is a brief description of how these 
systems were generated. 
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SI do i=l,n 

y(i) = rhs(i) 
do j=low(i) ,high(i) 

y(i) = y(i) - a(j)*y(column(j)) 
end do 
end do 


Figure 7: A Sparse Triangular Solve 


Ta 


ble 1: Preprocessed Doacross Times for Sparse Triangular Matrices 


Test 

Problem 

Preprocessed 
Doacross 
Time (ms) 

Preprocessed Doacross 
Iterations Rearranged 
Time (ms) 

Sequential 
Time (ms) 

SPE2 

34 

21 

223 

SPE5 

45 

23 

241 

5-PT 

37 

19 

192 

1 7-PT 

84 

56 

616 

9-PT 

97 

58 

698 


The loop in Figure 7 was parallelized on 16 processors and the paral- 
lelized and sequential times for the test matrices examined are depicted in 
Table 1. The timings obtained corresponded to parallel efficiencies between 
0.32 to 0.46. A modified loop was produced by carrying out the loop itera- 
tions in a more advantageous order. This reordering of loop iterations leaves 
the inter-iteration dependencies unchanged but reduces the effects of these 
dependencies on performance. The mechanism for carrying out this iteration 
reordering is described in [4] and is called a Doconsider transformation. The 
resulting loop is parallelized using the preprocessed doacross mechanism and 
the results are presented below in Table 1. Parallel efficiencies depicted in 
that table range from 0.63 to 0.75. 
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4 Conclusion 


The preprocessed doacross loop is a type of doacross loop that allows us 
to schedule loop iterations onto processors without prior knowledge of inter- 
iteration dependencies. We have demonstrated that such a loop structure can 
allow parallelization of loops that would not otherwise be easily parallelized. 
The overheads required to parallelize loops in this manner can be substantial 
but should not prevent us from achieving overall performance gains in many 
cases. 


5 Appendix: Definition of Test Triangular 
Systems 

The the triangular systems referred to in Section 3.2 were derived from the 

following partial differential equation discretizations: 

SPE2 This problem arises from the thermal simulation of a steam injection 
processes. The grid is 6x6x5 with 6 unknowns per grid point, this 
yields a system with 1080 equations. The matrix is a block seven point 
operator with 6x6 blocks. 

SPE5 This problem arises from a fully- implicit, simultaneous solution sim- 
ulation of a black oil model. It is a block seven point operator on a 
16x23x3 grid with 3x3 blocks yielding 3312 equations. 

5-PT The problem is a five point central difference discretization on a 63 x 
63 grid; this yields a system with 3969 equations. 

7-PT The problem is a seven point central difference discretization on a 20 
x 20 x 20 grid; this yields a system with 8000 equations. 

9-PT The problem is a nine point box scheme discretization on a 63 x 63 
grid; this yields a system with 3969 equations. 
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